Aryanto
Tujuan Modelling ini adalah memprediksi sentimen untuk postingan Twitter tertentu menggunakan Python. Analisis sentimen dapat memprediksi banyak emosi berbeda yang melekat pada teks, tetapi dalam laporan ini hanya 3 hal utama yang dipertimbangkan: positif, negatif, dan netral. Dataset pelatihan kecil (lebih dari 5900 contoh) dan data di dalamnya sangat miring, yang sangat berdampak pada kesulitan membangun pengklasifikasi yang baik. Setelah membuat banyak fitur kustom, memanfaatkan representasi bag-of-words dan word2vec serta menerapkan algoritme Extreme Gradient Boosting, akurasi klasifikasi pada level 58% tercapai.
Data was pre-processed using pandas, gensim and numpy libraries and the learning/validating process was built with scikit-learn. Plots were created using plotly.
from collections import Counter
import nltk
import pandas as pd
import re as regex
import numpy as np
import plotly
from plotly import graph_objs
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from time import time
import gensim
# plotly configuration
plotly.offline.init_notebook_mode()
This report was first prepared as a classical Python project using object oriented programming with maintainability in mind. In order to show this project as a Jupyter Notebook, the classes had to be splitted into multiple code-cells. In order to do so, the classes are suffixed with _PurposeOfThisSnippet name and they inherit one from another. The final class will be then run and the results will be shown.
The input data consisted two CSV files:
train.csv (5971 tweets) and test.csv (4000 tweets) - one for training and one for testing.
Format of the data was the following (test data didn't contain Category column):
| Id | Category | Tweet |
|---|---|---|
| 635930169241374720 | neutral | IOS 9 App Transport Security. Mm need to check if my 3rd party network pod supports it |
All tweets are in english, so it simplifies the processing and analysis.
class TwitterData_Initialize():
data = []
processed_data = []
wordlist = []
data_model = None
data_labels = None
is_testing = False
def initialize(self, csv_file, is_testing_set=False, from_cached=None):
if from_cached is not None:
self.data_model = pd.read_csv(from_cached)
return
self.is_testing = is_testing_set
if not is_testing_set:
self.data = pd.read_csv(csv_file, header=0, names=["id", "emotion", "text"])
self.data = self.data[self.data["emotion"].isin(["positive", "negative", "neutral"])]
else:
self.data = pd.read_csv(csv_file, header=0, names=["id", "text"],dtype={"id":"int64","text":"str"},nrows=4000)
not_null_text = 1 ^ pd.isnull(self.data["text"])
not_null_id = 1 ^ pd.isnull(self.data["id"])
self.data = self.data.loc[not_null_id & not_null_text, :]
self.processed_data = self.data
self.wordlist = []
self.data_model = None
self.data_labels = None
The code snippet above is prepared, to load the data form the given file for further processing, or just read already preprocessed file from the cache.
There's also a distinction between processing testing and training data. As the test.csv file was full of empty entries, they were removed.
Additional class properties such as data_model, wordlist etc. will be used further.
data = TwitterData_Initialize()
data.initialize("..\\Dataset\\train.csv")
data.processed_data.head(5)
| id | emotion | text | |
|---|---|---|---|
| 0 | 635769805279248384 | negative | Not Available |
| 1 | 635930169241374720 | neutral | IOS 9 App Transport Security. Mm need to check... |
| 2 | 635950258682523648 | neutral | Mar if you have an iOS device, you should down... |
| 3 | 636030803433009153 | negative | @jimmie_vanagon my phone does not run on lates... |
| 4 | 636100906224848896 | positive | Not sure how to start your publication on iOS?... |
First thing that can be done as soon as the data is loaded is to see the data distribution. The training set had the following distribution:
df = data.processed_data
neg = len(df[df["emotion"] == "negative"])
pos = len(df[df["emotion"] == "positive"])
neu = len(df[df["emotion"] == "neutral"])
dist = [
graph_objs.Bar(
x=["negative","neutral","positive"],
y=[neg, neu, pos],
)]
plotly.offline.iplot({"data":dist, "layout":graph_objs.Layout(title="Sentiment type distribution in training set")})
The targed of the following preprocessing is to create a Bag-of-Words representation of the data. The steps will execute as follows:
For the purpose of cleansing, the TwitterCleanup class was created. It consists methods allowing to execute all of the tasks show in the list above. Most of those is done using regular expressions.
The class exposes it's interface through iterate() method - it yields every cleanup method in proper order.
class TwitterCleanuper:
def iterate(self):
for cleanup_method in [self.remove_urls,
self.remove_usernames,
self.remove_na,
self.remove_special_chars,
self.remove_numbers]:
yield cleanup_method
@staticmethod
def remove_by_regex(tweets, regexp):
tweets.loc[:, "text"].replace(regexp, "", inplace=True)
return tweets
def remove_urls(self, tweets):
return TwitterCleanuper.remove_by_regex(tweets, regex.compile(r"http.?://[^\s]+[\s]?"))
def remove_na(self, tweets):
return tweets[tweets["text"] != "Not Available"]
def remove_special_chars(self, tweets): # it unrolls the hashtags to normal words
for remove in map(lambda r: regex.compile(regex.escape(r)), [",", ":", "\"", "=", "&", ";", "%", "$",
"@", "%", "^", "*", "(", ")", "{", "}",
"[", "]", "|", "/", "\\", ">", "<", "-",
"!", "?", ".", "'",
"--", "---", "#"]):
tweets.loc[:, "text"].replace(remove, "", inplace=True)
return tweets
def remove_usernames(self, tweets):
return TwitterCleanuper.remove_by_regex(tweets, regex.compile(r"@[^\s]+[\s]?"))
def remove_numbers(self, tweets):
return TwitterCleanuper.remove_by_regex(tweets, regex.compile(r"\s?[0-9]+\.?[0-9]*"))
The loaded tweets can be now cleaned.
class TwitterData_Cleansing(TwitterData_Initialize):
def __init__(self, previous):
self.processed_data = previous.processed_data
def cleanup(self, cleanuper):
t = self.processed_data
for cleanup_method in cleanuper.iterate():
if not self.is_testing:
t = cleanup_method(t)
else:
if cleanup_method.__name__ != "remove_na":
t = cleanup_method(t)
self.processed_data = t
data = TwitterData_Cleansing(data)
data.cleanup(TwitterCleanuper())
data.processed_data.head(5)
C:\Users\alastn87\Anaconda3\lib\site-packages\pandas\core\generic.py:6619: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| id | emotion | text | |
|---|---|---|---|
| 1 | 635930169241374720 | neutral | IOS App Transport Security Mm need to check if... |
| 2 | 635950258682523648 | neutral | Mar if you have an iOS device you should downl... |
| 3 | 636030803433009153 | negative | my phone does not run on latest IOS which may ... |
| 4 | 636100906224848896 | positive | Not sure how to start your publication on iOS ... |
| 5 | 636176272947744772 | neutral | Two Dollar Tuesday is here with Forklift Quick... |
For the text processing, nltk library is used. First, the tweets are tokenized using nlkt.word_tokenize and then, stemming is done using PorterStemmer as the tweets are 100% in english.
class TwitterData_TokenStem(TwitterData_Cleansing):
def __init__(self, previous):
self.processed_data = previous.processed_data
def stem(self, stemmer=nltk.PorterStemmer()):
def stem_and_join(row):
row["text"] = list(map(lambda str: stemmer.stem(str.lower()), row["text"]))
return row
self.processed_data = self.processed_data.apply(stem_and_join, axis=1)
def tokenize(self, tokenizer=nltk.word_tokenize):
def tokenize_row(row):
row["text"] = tokenizer(row["text"])
row["tokenized_text"] = [] + row["text"]
return row
self.processed_data = self.processed_data.apply(tokenize_row, axis=1)
data = TwitterData_TokenStem(data)
data.tokenize()
data.stem()
data.processed_data.head(5)
| id | emotion | text | tokenized_text | |
|---|---|---|---|---|
| 1 | 635930169241374720 | neutral | [io, app, transport, secur, mm, need, to, chec... | [IOS, App, Transport, Security, Mm, need, to, ... |
| 2 | 635950258682523648 | neutral | [mar, if, you, have, an, io, devic, you, shoul... | [Mar, if, you, have, an, iOS, device, you, sho... |
| 3 | 636030803433009153 | negative | [my, phone, doe, not, run, on, latest, io, whi... | [my, phone, does, not, run, on, latest, IOS, w... |
| 4 | 636100906224848896 | positive | [not, sure, how, to, start, your, public, on, ... | [Not, sure, how, to, start, your, publication,... |
| 5 | 636176272947744772 | neutral | [two, dollar, tuesday, is, here, with, forklif... | [Two, Dollar, Tuesday, is, here, with, Forklif... |
The wordlist (dictionary) is build by simple count of occurences of every unique word across all of the training dataset.
Before building the final wordlist for the model, let's take a look at the non-filtered version:
words = Counter()
for idx in data.processed_data.index:
words.update(data.processed_data.loc[idx, "text"])
words.most_common(5)
[('the', 3744), ('to', 2477), ('i', 1667), ('a', 1620), ('on', 1557)]
The most commont words (as expected) are the typical english stopwords. We will filter them out, however, as purpose of this analysis is to determine sentiment, words like "not" and "n't" can influence it greatly. Having this in mind, this word will be whitelisted.
stopwords=nltk.corpus.stopwords.words("english")
whitelist = ["n't", "not"]
for idx, stop_word in enumerate(stopwords):
if stop_word not in whitelist:
del words[stop_word]
words.most_common(5)
[('may', 1027), ('tomorrow', 764), ('day', 526), ('go', 499), ('thi', 495)]
Still, there are some words that seem too be occuring to many times, let's filter them. After some analysis, the lower bound was set to 3.
The wordlist is also saved to the csv file, so the same words can be used for the testing set.
class TwitterData_Wordlist(TwitterData_TokenStem):
def __init__(self, previous):
self.processed_data = previous.processed_data
whitelist = ["n't","not"]
wordlist = []
def build_wordlist(self, min_occurrences=3, max_occurences=500, stopwords=nltk.corpus.stopwords.words("english"),
whitelist=None):
self.wordlist = []
whitelist = self.whitelist if whitelist is None else whitelist
import os
if os.path.isfile("..\\Dataset\\wordlist.csv"):
word_df = pd.read_csv("..\\Dataset\\wordlist.csv")
word_df = word_df[word_df["occurrences"] > min_occurrences]
self.wordlist = list(word_df.loc[:, "word"])
return
words = Counter()
for idx in self.processed_data.index:
words.update(self.processed_data.loc[idx, "text"])
for idx, stop_word in enumerate(stopwords):
if stop_word not in whitelist:
del words[stop_word]
word_df = pd.DataFrame(data={"word": [k for k, v in words.most_common() if min_occurrences < v < max_occurences],
"occurrences": [v for k, v in words.most_common() if min_occurrences < v < max_occurences]},
columns=["word", "occurrences"])
word_df.to_csv("..\\Dataset\\wordlist.csv", index_label="idx")
self.wordlist = [k for k, v in words.most_common() if min_occurrences < v < max_occurences]
data = TwitterData_Wordlist(data)
data.build_wordlist()
words = pd.read_csv("..\\Dataset\\wordlist.csv")
x_words = list(words.loc[0:10,"word"])
x_words.reverse()
y_occ = list(words.loc[0:10,"occurrences"])
y_occ.reverse()
dist = [
graph_objs.Bar(
x=y_occ,
y=x_words,
orientation="h"
)]
plotly.offline.iplot({"data":dist, "layout":graph_objs.Layout(title="Top words in built wordlist")})
The data is ready to transform it to bag-of-words representation.
class TwitterData_BagOfWords(TwitterData_Wordlist):
def __init__(self, previous):
self.processed_data = previous.processed_data
self.wordlist = previous.wordlist
def build_data_model(self):
label_column = []
if not self.is_testing:
label_column = ["label"]
columns = label_column + list(
map(lambda w: w + "_bow",self.wordlist))
labels = []
rows = []
for idx in self.processed_data.index:
current_row = []
if not self.is_testing:
# add label
current_label = self.processed_data.loc[idx, "emotion"]
labels.append(current_label)
current_row.append(current_label)
# add bag-of-words
tokens = set(self.processed_data.loc[idx, "text"])
for _, word in enumerate(self.wordlist):
current_row.append(1 if word in tokens else 0)
rows.append(current_row)
self.data_model = pd.DataFrame(rows, columns=columns)
self.data_labels = pd.Series(labels)
return self.data_model, self.data_labels
Let's take a look at the data and see, which words are the most common for particular sentiments.
data = TwitterData_BagOfWords(data)
bow, labels = data.build_data_model()
bow.head(5)
| label | go_bow | thi_bow | wa_bow | not_bow | im_bow | see_bow | time_bow | get_bow | like_bow | ... | leadership_bow | snp_bow | tsiprass_bow | parliamentari_bow | alexi_bow | farag_bow | girlfriend_bow | castl_bow | crasher_bow | fiddl_bow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | neutral | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | neutral | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | negative | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | positive | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | neutral | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 2185 columns
grouped = bow.groupby(["label"]).sum()
words_to_visualize = []
sentiments = ["positive","negative","neutral"]
#get the most 7 common words for every sentiment
for sentiment in sentiments:
words = grouped.loc[sentiment,:]
words.sort_values(inplace=True,ascending=False)
for w in words.index[:7]:
if w not in words_to_visualize:
words_to_visualize.append(w)
#visualize it
plot_data = []
for sentiment in sentiments:
plot_data.append(graph_objs.Bar(
x = [w.split("_")[0] for w in words_to_visualize],
y = [grouped.loc[sentiment,w] for w in words_to_visualize],
name = sentiment
))
plotly.offline.iplot({
"data":plot_data,
"layout":graph_objs.Layout(title="Most common words across sentiments")
})
Some of the most common words show high distinction between classes like go and see and other are occuring in similiar amount for every class (plan, obama).
None of the most common words is unique to the negative class. At this point, it's clear that skewed data distribution will be a problem in distinguishing negative tweets from the others.
First of all, lets establish seed for random numbers generators.
import random
seed = 666
random.seed(seed)
The following utility function will train the classifier and show the F1, precision, recall and accuracy scores.
def test_classifier(X_train, y_train, X_test, y_test, classifier):
log("")
log("===============================================")
classifier_name = str(type(classifier).__name__)
log("Testing " + classifier_name)
now = time()
list_of_labels = sorted(list(set(y_train)))
model = classifier.fit(X_train, y_train)
log("Learing time {0}s".format(time() - now))
now = time()
predictions = model.predict(X_test)
log("Predicting time {0}s".format(time() - now))
precision = precision_score(y_test, predictions, average=None, pos_label=None, labels=list_of_labels)
recall = recall_score(y_test, predictions, average=None, pos_label=None, labels=list_of_labels)
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions, average=None, pos_label=None, labels=list_of_labels)
log("=================== Results ===================")
log(" Negative Neutral Positive")
log("F1 " + str(f1))
log("Precision" + str(precision))
log("Recall " + str(recall))
log("Accuracy " + str(accuracy))
log("===============================================")
return precision, recall, accuracy, f1
def log(x):
#can be used to write to log file
print(x)
It is nice to see what kind of results we might get from such simple model. The bag-of-words representation is binary, so Naive Bayes Classifier seems like a nice algorithm to start the experiments.
The experiment will be based on 7:3 train:test stratified split.
from sklearn.naive_bayes import BernoulliNB
X_train, X_test, y_train, y_test = train_test_split(bow.iloc[:, 1:], bow.iloc[:, 0],
train_size=0.7, stratify=bow.iloc[:, 0],
random_state=seed)
precision, recall, accuracy, f1 = test_classifier(X_train, y_train, X_test, y_test, BernoulliNB())
===============================================
Testing BernoulliNB
Learing time 0.7945945262908936s
Predicting time 0.3211078643798828s
=================== Results ===================
Negative Neutral Positive
F1 [0.38949672 0.45214221 0.71411765]
Precision[0.45408163 0.4853229 0.65978261]
Recall [0.34099617 0.42320819 0.77820513]
Accuracy 0.5802089735709896
===============================================
Result with accuracy at level of 58% seems to be quite nice result for such basic algorithm like Naive Bayes (having in mind that random classifier would yield result of around 33% accuracy). This performance may not hold for the final testing set. In order to see how the NaiveBayes performs in more general cases, 8-fold crossvalidation is used. The 8 fold is used, to optimize speed of testing on my 8-core machine.
def cv(classifier, X_train, y_train):
log("===============================================")
classifier_name = str(type(classifier).__name__)
now = time()
log("Crossvalidating " + classifier_name + "...")
accuracy = [cross_val_score(classifier, X_train, y_train, cv=8, n_jobs=-1)]
log("Crosvalidation completed in {0}s".format(time() - now))
log("Accuracy: " + str(accuracy[0]))
log("Average accuracy: " + str(np.array(accuracy[0]).mean()))
log("===============================================")
return accuracy
nb_acc = cv(BernoulliNB(), bow.iloc[:,1:], bow.iloc[:,0])
=============================================== Crossvalidating BernoulliNB... Crosvalidation completed in 8.039513111114502s Accuracy: [0.54719764 0.48820059 0.28318584 0.31268437 0.34218289 0.50369276 0.47267356 0.53028065] Average accuracy: 0.43501228742107945 ===============================================
This result no longer looks optimistic. For some of the splits, Naive Bayes classifier showed performance below the performance of random classifier.
In order to not push any other aglorithm to the limit on the current data model, let's try to add some features that might help to classify tweets.
A common sense suggest that special characters like exclamation marks and the casing might be important in the task of determining the sentiment. The following features will be added to the data model:
| Feature name | Explanation |
|---|---|
| Number of uppercase | people tend to express with either positive or negative emotions by using A LOT OF UPPERCASE WORDS |
| Number of ! | exclamation marks are likely to increase the strength of opinion |
| Number of ? | might distinguish neutral tweets - seeking for information |
| Number of positive emoticons | positive emoji will most likely not occur in the negative tweets |
| Number of negative emoticons | inverse to the one above |
| Number of ... | commonly used in commenting something |
| Number of quotations | same as above |
| Number of mentions | sometimes people put a lot of mentions on positive tweets, to share something good |
| Number of hashtags | just for the experiment |
| Number of urls | similiar to the number of mentions |
Extraction of those features must be done before any preprocessing happens.
For the purpose of emoticons, the EmoticonDetector class is created. The file emoticons.txt contains list of positive and negative emoticons, which are used.
class EmoticonDetector:
emoticons = {}
def __init__(self, emoticon_file="..\\Dataset\\emoticons.txt"):
from pathlib import Path
content = Path(emoticon_file).read_text()
positive = True
for line in content.split("\n"):
if "positive" in line.lower():
positive = True
continue
elif "negative" in line.lower():
positive = False
continue
self.emoticons[line] = positive
def is_positive(self, emoticon):
if emoticon in self.emoticons:
return self.emoticons[emoticon]
return False
def is_emoticon(self, to_check):
return to_check in self.emoticons
class TwitterData_ExtraFeatures(TwitterData_Wordlist):
def __init__(self):
pass
def build_data_model(self):
extra_columns = [col for col in self.processed_data.columns if col.startswith("number_of")]
label_column = []
if not self.is_testing:
label_column = ["label"]
columns = label_column + extra_columns + list(
map(lambda w: w + "_bow",self.wordlist))
labels = []
rows = []
for idx in self.processed_data.index:
current_row = []
if not self.is_testing:
# add label
current_label = self.processed_data.loc[idx, "emotion"]
labels.append(current_label)
current_row.append(current_label)
for _, col in enumerate(extra_columns):
current_row.append(self.processed_data.loc[idx, col])
# add bag-of-words
tokens = set(self.processed_data.loc[idx, "text"])
for _, word in enumerate(self.wordlist):
current_row.append(1 if word in tokens else 0)
rows.append(current_row)
self.data_model = pd.DataFrame(rows, columns=columns)
self.data_labels = pd.Series(labels)
return self.data_model, self.data_labels
def build_features(self):
def count_by_lambda(expression, word_array):
return len(list(filter(expression, word_array)))
def count_occurences(character, word_array):
counter = 0
for j, word in enumerate(word_array):
for char in word:
if char == character:
counter += 1
return counter
def count_by_regex(regex, plain_text):
return len(regex.findall(plain_text))
self.add_column("splitted_text", map(lambda txt: txt.split(" "), self.processed_data["text"]))
# number of uppercase words
uppercase = list(map(lambda txt: count_by_lambda(lambda word: word == word.upper(), txt),
self.processed_data["splitted_text"]))
self.add_column("number_of_uppercase", uppercase)
# number of !
exclamations = list(map(lambda txt: count_occurences("!", txt),
self.processed_data["splitted_text"]))
self.add_column("number_of_exclamation", exclamations)
# number of ?
questions = list(map(lambda txt: count_occurences("?", txt),
self.processed_data["splitted_text"]))
self.add_column("number_of_question", questions)
# number of ...
ellipsis = list(map(lambda txt: count_by_regex(regex.compile(r"\.\s?\.\s?\."), txt),
self.processed_data["text"]))
self.add_column("number_of_ellipsis", ellipsis)
# number of hashtags
hashtags = list(map(lambda txt: count_occurences("#", txt),
self.processed_data["splitted_text"]))
self.add_column("number_of_hashtags", hashtags)
# number of mentions
mentions = list(map(lambda txt: count_occurences("@", txt),
self.processed_data["splitted_text"]))
self.add_column("number_of_mentions", mentions)
# number of quotes
quotes = list(map(lambda plain_text: int(count_occurences("'", [plain_text.strip("'").strip('"')]) / 2 +
count_occurences('"', [plain_text.strip("'").strip('"')]) / 2),
self.processed_data["text"]))
self.add_column("number_of_quotes", quotes)
# number of urls
urls = list(map(lambda txt: count_by_regex(regex.compile(r"http.?://[^\s]+[\s]?"), txt),
self.processed_data["text"]))
self.add_column("number_of_urls", urls)
# number of positive emoticons
ed = EmoticonDetector()
positive_emo = list(
map(lambda txt: count_by_lambda(lambda word: ed.is_emoticon(word) and ed.is_positive(word), txt),
self.processed_data["splitted_text"]))
self.add_column("number_of_positive_emo", positive_emo)
# number of negative emoticons
negative_emo = list(map(
lambda txt: count_by_lambda(lambda word: ed.is_emoticon(word) and not ed.is_positive(word), txt),
self.processed_data["splitted_text"]))
self.add_column("number_of_negative_emo", negative_emo)
def add_column(self, column_name, column_content):
self.processed_data.loc[:, column_name] = pd.Series(column_content, index=self.processed_data.index)
data = TwitterData_ExtraFeatures()
data.initialize("..\\Dataset\\train.csv")
data.build_features()
data.cleanup(TwitterCleanuper())
data.tokenize()
data.stem()
data.build_wordlist()
data_model, labels = data.build_data_model()
data_model.head(5)
C:\Users\alastn87\Anaconda3\lib\site-packages\pandas\core\generic.py:6619: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| label | number_of_uppercase | number_of_exclamation | number_of_question | number_of_ellipsis | number_of_hashtags | number_of_mentions | number_of_quotes | number_of_urls | number_of_positive_emo | ... | leadership_bow | snp_bow | tsiprass_bow | parliamentari_bow | alexi_bow | farag_bow | girlfriend_bow | castl_bow | crasher_bow | fiddl_bow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | neutral | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | neutral | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | negative | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | positive | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | neutral | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 2195 columns
Let's see how (some) of the extra features separate the data set. Some of them, i.e number exclamation marks, number of pos/neg emoticons do this really well. Despite of the good separation, those features sometimes occur only on small subset of the training dataset.
sentiments = ["positive","negative","neutral"]
plots_data_ef = []
for what in map(lambda o: "number_of_"+o,["positive_emo","negative_emo","exclamation","hashtags","question"]):
ef_grouped = data_model[data_model[what]>=1].groupby(["label"]).count()
plots_data_ef.append({"data":[graph_objs.Bar(
x = sentiments,
y = [ef_grouped.loc[s,:][0] for s in sentiments],
)], "title":"How feature \""+what+"\" separates the tweets"})
for plot_data_ef in plots_data_ef:
plotly.offline.iplot({
"data":plot_data_ef["data"],
"layout":graph_objs.Layout(title=plot_data_ef["title"])
})
As a second attempt on the classification the Random Forest will be used.
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(data_model.iloc[:, 1:], data_model.iloc[:, 0],
train_size=0.7, stratify=data_model.iloc[:, 0],
random_state=seed)
precision, recall, accuracy, f1 = test_classifier(X_train, y_train, X_test, y_test, RandomForestClassifier(random_state=seed,n_estimators=403,n_jobs=-1))
===============================================
Testing RandomForestClassifier
Learing time 10.748103857040405s
Predicting time 0.7178220748901367s
=================== Results ===================
Negative Neutral Positive
F1 [0.24362606 0.47313692 0.69605037]
Precision[0.4673913 0.4806338 0.62874871]
Recall [0.16475096 0.46587031 0.77948718]
Accuracy 0.5679164105716041
===============================================
The accuracy for the initial split was lower than the one for the Naive Bayes, but let's see what happens during crossvalidation:
rf_acc = cv(RandomForestClassifier(n_estimators=403,n_jobs=-1, random_state=seed),data_model.iloc[:, 1:], data_model.iloc[:, 0])
=============================================== Crossvalidating RandomForestClassifier... Crosvalidation completed in 103.75729584693909s Accuracy: [0.55309735 0.49115044 0.39233038 0.2920354 0.35988201 0.5140325 0.51846381 0.56129985] Average accuracy: 0.4602864668435706 ===============================================
It looks better, however it's still not much above accuracy of the random classifier and barely better than Naive Bayes classifier.
We can observe a low recall level of the RandomForest classifier for the negative class, which may be caused by the data skewness.
Experiment showed that prediction of text sentiment is a non-trivial task for machine learning. A lot of preprocessing is required just to be able to run any algorithm and see - usually not great - results. Main problem for sentiment analysis is to craft the machine representation of the text. Simple bag-of-words was definitely not enough to obtain satisfying results, thus a lot of additional features were created basing on common sense (number of emoticons, exclamation marks etc.). Word2vec representation significantly raised the predictions quality. I think that a slight improvement in classification accuracy for the given training dataset could be developed, but since it contained highly skewed data (small number of negative cases), the difference will be probably in the order of a few percent. The thing that could possibly improve classification results will be to add a lot of additional examples (increase training dataset), because given 5971 examples obviously do not contain every combination of words usage, moreover - a lot of emotion-expressing words surely are missing.